## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating from 0 (very bad) to 10 (very excellent).
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of fixed acidity is positive skewed. Most of the wines have fixed acidity between 7.10 and 9.20.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The valatile acidity shows a bimodal distribution and positive skewness. Most of the wines have volatile acidity between 0.39 and 0.64.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.120 7.680 8.445 8.847 9.740 16.285
Total acidity is composed of fixed and volatile acidity. The distribution of total acidity is positive skewed with median at 8.445.
The residual sugar shows left-biased and long-tailed distribution.
The chlorides show left-biased and long-tailed distribution.
The total sulfur dioxide has some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Most of the wines have a density between 0.9956 and 0.9978.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Most of the wines have pH between 3.210 and 3.400.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Most of the wines have 5 or 6 in quality.
There are 15,999 red wines in the dataset with 13 features (X, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). X identifies the wines, and quality represents that how good the wine. The X and quality are unordered and ordered factor variables, but I treated them as numerical variables for convenience. All other features represent chemical properties of wine.
Other observations:
The main feature in the data set is quality. I’d like to determine which features are best for predicting the wine quality. I suspect quality and some combination of the other variables can be used to build a predictive model for wine quality.
The primary wine characteristics are sweetness, acidity, tannin, alcohol, and body. Residual sugar, fixed and volatile acidity, alcohol, and density determine those characteristics. I guess that these variables are mainly related to the wine quality.
I created a variable for the total acidity using the volatile and the fixed acids.
Volatile acidity shows a bimodal distribution.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## total.acidity 0.99 -0.16 0.63
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.acidity 0.12 0.10 -0.16
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## total.acidity -0.11 0.68 -0.67 0.16 -0.08
## quality total.acidity
## fixed.acidity 0.12 0.99
## volatile.acidity -0.39 -0.16
## citric.acid 0.23 0.63
## residual.sugar 0.01 0.12
## chlorides -0.13 0.10
## free.sulfur.dioxide -0.05 -0.16
## total.sulfur.dioxide -0.19 -0.11
## density -0.17 0.68
## pH -0.06 -0.67
## sulphates 0.25 0.16
## alcohol 0.48 -0.08
## quality 1.00 0.09
## total.acidity 0.09 1.00
The fixed acidity and volatile acidity has strong positive and negative correlations with citric acid.
The pH has a strong negative correlation with fixed acidity, citric acid, but does not with volatile acidity.
The fixed acidity and alcohol have significant positive and negative correlations with density, respectively.
Most of the variables do not seem to have strong correlations with quality, but alcohol and volatile acidity have moderate positive and negative correlation with quality, respectively.
The strongest correlation in this data set appears between fixed acidity and pH. High acidity means low pH, and the graph coincides with this fact.
Citric acid is ne of the main component of fixed acidity. Therefore the two variable has a strong positive correlation.
The fixed acidity has a strong positive correlation with density, too.
Yeast in wine convert citric acid to acetic acid, most of the volatile acid. Therefore, volatile acidity and citric acid is in a reverse relation.
The citric acid has moderate negative correlations with volatile acidity and pH.
The alcohol and density also show moderate negative correlation.
Quality of wine tends to increase as volatile acidity decreases, because the main component of volatile acid is acetic acid which causes an unpleasant vinegar taste.
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = wqr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.79071 -0.54411 -0.00687 0.47350 2.93148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.56575 0.05791 113.39 <2e-16 ***
## volatile.acidity -1.76144 0.10389 -16.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7437 on 1597 degrees of freedom
## Multiple R-squared: 0.1525, Adjusted R-squared: 0.152
## F-statistic: 287.4 on 1 and 1597 DF, p-value: < 2.2e-16
Based on the value of R-squared, volatile acidity contributes only about 15.2% of the Wine quality.
##
## Call:
## lm(formula = quality ~ I(sqrt(alcohol)), data = wqr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8551 -0.4087 -0.1711 0.5115 2.5870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0237 0.3538 -5.72 1.27e-08 ***
## I(sqrt(alcohol)) 2.3756 0.1096 21.68 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7101 on 1597 degrees of freedom
## Multiple R-squared: 0.2274, Adjusted R-squared: 0.2269
## F-statistic: 469.9 on 1 and 1597 DF, p-value: < 2.2e-16
Based on the value of R-squared, alcohol contributes to the wine quality only about 15.2%.
Residual sugar determines the sweetness of the wine. Most of the wine maintain an certain level of sweetness.
The quality correlates with alcohol and volatile acidity.
Citric acid is one of the main components of fixed acidity. As a result, they have a strong positive correlation.
High fixed acidity causes low pH. Therefore, fixed acidity and citric acid negatively correlates with the pH.
Wine with more volatile acidity tends to have less citric acid.
Wine with more fixed acidity tends to denser. By the way, A wine with more alcohol tends to less dense.
The fixed acidity is positively and strongly correlated with citric acid and density. The citric acid may substitute for fixed acidity and density with even better estimation of wine quality.
c(cor(wqr$volatile.acidity, wqr$sulphates),
cor(wqr$volatile.acidity, log10(wqr$sulphates)))
## [1] -0.2609867 -0.3005487
Transformation of sulphates to log10(sulphates) increase the correlation between sulphates and volatile acidity.
c(cor(wqr$alcohol, wqr$pH), cor(wqr$alcohol, wqr$pH^7))
## [1] 0.2056325 0.2287039
Transformation of pH to pH^7 increases the correlation between pH and alcohol little bit. As shown below, this leads the increase of our model accuracy little bit.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides + total.sulfur.dioxide + pH, data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## chlorides + total.sulfur.dioxide + pH + citric.acid, data = wqr)
##
## ==========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.777*** 3.005*** 4.296*** 4.613***
## (0.175) (0.184) (0.196) (0.199) (0.204) (0.400) (0.461)
## alcohol 0.361*** 0.314*** 0.309*** 0.292*** 0.277*** 0.291*** 0.295***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.017) (0.017)
## volatile.acidity -1.384*** -1.221*** -1.167*** -1.142*** -1.038*** -1.115***
## (0.095) (0.097) (0.097) (0.097) (0.100) (0.115)
## sulphates 0.679*** 0.874*** 0.915*** 0.889*** 0.899***
## (0.101) (0.111) (0.110) (0.110) (0.110)
## chlorides -1.645*** -1.705*** -2.002*** -1.915***
## (0.394) (0.392) (0.398) (0.403)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002***
## (0.001) (0.001) (0.001)
## pH -0.435*** -0.525***
## (0.116) (0.133)
## citric.acid -0.167
## (0.121)
## --------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.343 0.351 0.357 0.358
## adj. R-squared 0.226 0.316 0.335 0.341 0.349 0.355 0.355
## sigma 0.710 0.668 0.659 0.655 0.651 0.649 0.649
## F 468.267 370.379 268.912 208.125 172.683 147.427 126.712
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1590.682 -1580.383 -1573.351 -1572.389
## Deviance 805.870 711.796 692.105 684.612 675.850 669.931 669.126
## AIC 3448.114 3251.628 3208.768 3193.364 3174.767 3162.701 3162.778
## BIC 3464.245 3273.136 3235.654 3225.626 3212.407 3205.719 3211.173
## N 1599 1599 1599 1599 1599 1599 1599
## ==========================================================================================================================
The first trial of linear model accounts for 35.7% of the variance. The variables with less significance were removed.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wqr)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wqr)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)),
## data = wqr)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides, data = wqr)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide, data = wqr)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide + I(pH^7), data = wqr)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid,
## data = wqr)
##
## ==========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.369*** 3.742*** 3.998*** 4.003*** 4.099***
## (0.175) (0.184) (0.184) (0.201) (0.208) (0.207) (0.212)
## alcohol 0.361*** 0.314*** 0.303*** 0.285*** 0.270*** 0.289*** 0.295***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.017) (0.017)
## volatile.acidity -1.384*** -1.156*** -1.099*** -1.076*** -0.940*** -1.043***
## (0.095) (0.097) (0.098) (0.097) (0.101) (0.114)
## I(log10(sulphates)) 1.477*** 1.794*** 1.843*** 1.849*** 1.894***
## (0.177) (0.190) (0.189) (0.188) (0.190)
## chlorides -1.694*** -1.729*** -2.063*** -1.935***
## (0.383) (0.380) (0.385) (0.390)
## total.sulfur.dioxide -0.002*** -0.002*** -0.002***
## (0.001) (0.001) (0.001)
## I(pH^7) -0.000*** -0.000***
## (0.000) (0.000)
## citric.acid -0.228
## (0.118)
## --------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.345 0.353 0.361 0.370 0.371
## adj. R-squared 0.226 0.316 0.344 0.352 0.359 0.367 0.368
## sigma 0.710 0.668 0.654 0.650 0.646 0.642 0.642
## F 468.267 370.379 280.646 217.837 180.338 155.588 134.130
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1587.752 -1577.984 -1568.023 -1557.699 -1555.809
## Deviance 805.870 711.796 682.108 673.825 665.482 656.943 655.393
## AIC 3448.114 3251.628 3185.503 3167.967 3150.046 3131.397 3129.619
## BIC 3464.245 3273.136 3212.389 3200.230 3187.686 3174.414 3178.013
## N 1599 1599 1599 1599 1599 1599 1599
## ==========================================================================================================================
The variables in this linear model can account for 37.0% of the variance in the quality of the wine. By using log10(sulphates) and pH^7, we could improve the result compared to 35.7% without transformation.
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + I(log10(sulphates)) +
## chlorides + total.sulfur.dioxide + I(pH^7) + citric.acid,
## data = wqr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.63753 -0.37786 -0.03801 0.44159 1.96876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.099e+00 2.123e-01 19.308 < 2e-16 ***
## alcohol 2.948e-01 1.708e-02 17.260 < 2e-16 ***
## volatile.acidity -1.043e+00 1.138e-01 -9.161 < 2e-16 ***
## I(log10(sulphates)) 1.894e+00 1.895e-01 9.995 < 2e-16 ***
## chlorides -1.935e+00 3.905e-01 -4.954 8.04e-07 ***
## total.sulfur.dioxide -2.207e-03 5.023e-04 -4.394 1.18e-05 ***
## I(pH^7) -6.244e-05 1.265e-05 -4.936 8.83e-07 ***
## citric.acid -2.281e-01 1.176e-01 -1.940 0.0525 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6418 on 1591 degrees of freedom
## Multiple R-squared: 0.3711, Adjusted R-squared: 0.3684
## F-statistic: 134.1 on 7 and 1591 DF, p-value: < 2.2e-16
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Transformation of sulphate and pH increases the correlations with other variables. These transformations give clue to make a better linear model.
High alcohol and low volatile acidity contents seem to produce better wines.
I created a couple of linear models. Though the confidence level of the model could be increased a bit by transforming a couple of variables, the final model still is not satisfactory. This can be due to the fact that our dataset contains a small number of observations. Furthermore, most of the observations are from middle-classed wines. This makes it difficult that the model predict the edge cases. Maybe a more supplement dataset with more edge cases would help to predict the accurate quality of wines.
Alcohol percentage plays a primary role in determining the quality of wines. The higher the alcohol percentage, the better the wine quality. But previously from our linear model test, R-Squared value tells that alcohol alone contributes only about 22% in the variance of the wine quality. So alcohol is not the only factor which is responsible for the improvement in wine quality.
The volatile acidity has a negative relation with wine quality, though it is weaker than that of alcohol. It seems that the main component of volatile acid is an acetic acid which causes the unpleasant vinegar taste.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the model fails to predict the good and bad quality wines. This is evident from the fact that most data sets contain ‘average’ quality wine and there are insufficient observations in the extreme range. The R-squared value of our model can only account for about 37.1% observations.
The data analyzed in this project contains the chemical properties and quality information for 15,999 red wines. Based on the statistics of the chemical properties of wines, I tried to establish a model to predict the quality of each red wine.
Some of the ingredients showed strong correlations with others, and their relationship could be explained chemically. Alcohol and volatile acidity were directly correlated with the quality of the wine, and these characteristics helped to establish a quality prediction model. Sulphate and pH were able to increase the correlation with quality by using variable transformations, and this attempt helped to raise the quality prediction accuracy to some extent.
The resultant model had a low prediction success rate because most of the wine contained in the data had a quality of 5 or 6 and the number of samples for other qualities was not sufficient.